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Abstract 

In  this  paper  we  develop  a  representation  of  a  class  of  feedforward  neural  networks  in 
terms  of  discrete  affine  wavelet  transforms.  It  is  shown  that  by  appropriate  grouping  of 
terms,  feedforward  neural  networks  with  sigmoidal  activation  functions  can  be  viewed  as  ar¬ 
chitectures  which  implement  affine  wavelet  decompositions  of  mappings.  This  result  follows 
simply  from  the  observation  that  standard  feedforward  network  architectures  possess  an 
inherent  translation-dilation  structure  and  every  node  implements  the  same  activation  func¬ 
tion.  It  is  shown  that  the  wavelet  transform  formalism  provides  a  mathematical  framework 
within  which  it  is  possible  to  perform  both  analysis  and  synthesis  of  feedforward  networks. 
For  the  purposes  of  analysis,  the  wavelet  formulation  characterizes  a  class  (L2(1R)  )  of 
mappings  which  can  be  implemented  by  feedforward  networks  as  well  as  reveals  the  exact 
implementation  of  a  given  mapping  in  this  class.  Spatio-spectral  localization  properties  of 
wavelets  can  be  exploited  in  synthesizing  a  feedforward  network  to  perform  a  given  approx¬ 
imation  task.  Synthesis  procedures  based  on  spatio-spectral  localization  result  in  reducing 
the  training  problem  to  one  of  convex  optimization.  We  outline  two  such  synthesis  schemes. 


‘This  research  was  supported  in  part  by  the  National  Science  Foundation’s  Engineering  Research  Centers 
Program:  NSFD  CDR  8803012,  the  Air  Force  Office  of  Scientific  Research  under  contract  AFOSR-88-0204  and 
by  the  Naval  Research  Laboratory. 

*Also  with  Nanoelectronics  Processing  Facility,  Code  6804,  Naval  Research  Laboratories,  Washington  D.C. 
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in  choosing  the  particular  set  of  ‘basis’  functions  which  are  used  to  implement  the  transform.  In 
the  case  of  discrete  affine  wavelet  transforms,  which  we  discuss  in  Section  3,  the  ‘basis’  functions 
are  generated  by  translating  and  dilating  a  single  function. 

In  Section  4  we  demonstrate  that  affine  wavelet  decompositions  of  functions  can  be  imple¬ 
mented  within  the  standard  architecture  of  feedforward  neural  networks.  Sigmoidal  functions 
have  traditionally  been  used  as  ‘activation’  functions  of  nodes  in  a  neural  network.  Section  4.1 
is  concerned  with  constructing  a  wavelet  ‘basis’  using  combinations  of  sigmoids.  For  simplicity, 
we  restrict  discussion  to  networks  designed  to  learn  one-dimensional  maps.  One  of  the  main 
results  of  this  paper  is  Theorem  4.1.  In  Section  4.2  we  briefly  describe  extensions  of  these  results 
to  higher  dimensions. 

In  Section  5  we  outline  two  schemes  in  which  spatio-spectral  localization  properties  of 
wavelets  are  used  to  formulate  synthesis  procedures  for  feedforward  neural  networks.  It  is 
shown  that  such  synthesis  procedures  can  result  in  systematic  definition  of  network  topology 
and  simplified  network  ‘training’  problems.  Most  of  the  weights  in  the  network  are  determined 
via  the  synthesis  process  and  the  remaining  weights  may  be  obtained  as  a  solution  to  a  convex 
optimization  problem.  Since  the  resulting  optimization  problem  is  one  of  least  squares  ap¬ 
proximation,  the  remaining  weights  can  also  be  determined  by  solving  the  associated  ‘normal 
equations’. 

A  few  simple  numerical  simulations  of  the  methods  of  this  paper  are  provided  in  Section 

5.4. 

2  Functional  Approximation  and  Neural  Networks 

This  section  provides  a  brief  introduction  to  the  application  of  feedforward  neural  networks  to 
functional  approximation  problems. 

Let  0  be  a  set  containing  pairs  of  sampled  inputs  and  the  corresponding  outputs  generated 
by  an  unknown  map,  /  :  RTO  — *  ft”,  m,n  <  oo,  i.e.  0  =  { (a:1,  r/1 )  :  y'  —  f(xl);  x'  e  ]Rm,  yl  e 
lRn,  i  =  1  K  <  oo}.  We  call  0  the  training  set.  Note  that  the  samples  in  ©  need  not 

be  uniformly  distributed.  In  this  context,  the  task  of  functional  approximation  is  to  use  the 
data  provided  in  0  to  ‘learn’  (approximate)  the  map  /.  Many  existing  schemes  to  perform 
this  task  are  based  on  parametrically  fitting  a  particular  functional  form  to  the  given  data. 
Simple  examples  of  such  schemes  are  those  which  attempt  to  fit  linear  models  or  polynomials  of 
fixed  degree  to  the  data  in  0.  More  recently,  nonlinear  feedforward  neural  networks  have  been 
applied  to  the  task  of  ‘learning’  the  map  /.  In  the  interest  of  keeping  this  papaer  self-contained, 
an  overview  of  the  neural  network  approach  is  given  below. 

2.1  Feedforward  Neural  Networks 

The  basic  component  in  a  feedforward  neural  network  is  the  single  ‘neuron’  model  depicted  in 
Figure  2(a) .  Where  ux,...,un  are  the  inputs  to  the  neuron,  kr, ...  ,kn  are  multiplicative  weights 
applied  to  the  inputs,  /  is  a  biasing  input,  g  :  IR,  — >  1R,  and  y  is  the  output  of  the  neuron.  Thus 
y  —  ff(^"=i  kiUi+I).  The  ‘neuron’  of  Figure  2(a)  is  often  depicted  as  shown  in  Figure  2(b)  where 
the  input  weights,  bias,  summation,  and  function  g  are  implicit.  Traditionally,  the  activation 
function  g  has  been  chosen  to  be  the  sigmoidal  nonlinearity  shown  in  Figure  3.  This  choice  of  g 
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Figure  2:  (a)Single  neuron  model,  (b)  Simplified  schematic  of  single  neuron 


was  initially  based  upon  the  observed  firing  rate  response  of  biological  neurons.  A  feedforward 
neural  network  is  constructed  by  interconnecting  a  number  of  neurons  (such  as  the  one  shown 
in  Figure  2)  so  as  to  form  a  network  in  which  all  connections  are  made  in  the  forward  direction 
(from  input  to  output  without  feedback  loops)  as  in  Figure  4.  Neural  networks  of  this  form  are 
usually  comprised  of  an  input  layer,  a  number  of  hidden  layers,  and  an  output  layer.  The  input 
layer  consists  of  neurons  which  accept  external  inputs  to  the  network.  Inputs  and  outputs  of 
the  hidden  layers  are  internal  to  the  network,  and  hence  the  term  ‘hidden’.  Outputs  of  neurons 
in  the  output  layer  are  the  external  outputs  of  the  network.  Once  the  structure  of  a  feedforward 
network  has  been  decided,  i.e  the  number  of  hidden  layers  and  the  number  of  nodes  in  each 
hidden  layer  has  been  set,  a  mapping  is  ‘learned’  by  varying  the  connection  weights,  W{j ’s  and 
the  biases,  ij’s  so  as  to  obtain  the  desired  input-output  response  for  the  network1. 

One  method  often  used  to  vary  the  weights  and  biases  is  known  as  the  backpropagation 
algorithm  in  which  the  weights  and  biases  are  modified  so  as  to  minimize  a  cost  functional  of 
the  form, 

e=  E  (i) 

(z'.y'le© 

where  O'  is  the  output  vector  (at  the  output  layer)  of  the  network  when  x%  is  applied  at  the 
input.  Backpropagation  employs  gradient  descent  to  minimize  E.  That  is,  the  weights  and 
biases  are  varied  in  accordance  with  the  rules, 


dE 


Awij  — 


and  A I  a 


dE_ 
:d  If 


Feedforward  neural  networks  are  known  to  have  empirically  demonstrated  ability  to  ap¬ 
proximate  complicated  maps  very  well  using  the  technique  just  described.  However,  to  date 

1  We  will  use  Wij  to  denote  the  weight  applied  to  the  output  Oj  of  the  jth  neuron  when  connecting  it  to  the 
input  of  the  i1*1  neuron.  is  the  bias  input  to  the  neuron. 
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Figure  3:  Sigmoidal  nonlinearity. 

there  does  not  exist  a  satisfactory  theoretical  foundation  for  such  an  approach.  We  feel  that 
a  satisfactory  theoretical  foundation  should  provide  more  than  just  a  proof  that  feedforward 
networks  can  indeed  approximate  certain  classes  of  maps  arbitrarily  well.  Some  of  the  problems 
that  one  should  be  able  to  address  within  a  good  theoretical  setting  are  the  following: 

(1)  Development  of  a  well-founded  systematic  approach  to  choosing  the  number  of  hidden 
layers  and  the  number  of  nodes  in  each  hidden  layer  required  to  achieve  a  given  level  of 
performance  in  a  given  application. 

(2)  Learning  algorithms  often  ignore  much  of  the  information  contained  in  the  training  data, 
and  thereby  overlook  potential  simplification  of  the  weight  setting  problem.  As  we  will 
show  later,  preprocessing  of  training  data  results  in  convexity  of  the  training  problem. 

(3)  An  inability  to  adequately  explain  empirically  observed  phenomena.  For  example,  the  cost 
functional  E  may  possess  many  local  minima  due  to  the  nonlinearities  in  the  network.  A 
gradient  descent  scheme  such  as  backpropagation  is  bound  to  settle  to  such  local  minima. 
However,  in  many  cases,  it  has  been  observed  that  settling  to  a  local  minimum  of  E  does 
not  adversely  affect  overall  performance  of  the  network.  Observations  such  as  this  demand 
a  suitable  explanatory  theoretical  framework. 

The  methods  of  this  paper  offer  a  framework  within  which  it  is  possible  to  address  at  least 
the  first  two  issues  above. 

3  Time- Frequency  Localization  and  Discrete  Affine  Wavelet  Trans¬ 
forms 

In  this  section  we  review  some  basic  properties  of  frames  and  discrete  affine  wavelet  transforms. 

We  also  introduce  some  definitions  to  formalize  the  concept  of  time-frequency  localization.  To 
avoid  confusion,  we  point  out  that  throughout  this  paper  we  will  refer  to  the  domain  of  the 
map  to  be  approximated  as  time  or  space  interchangably. 

Given  a  separable  Hilbert  space  H,  we  know  that  it  is  possible  to  find  an  orthonormal 
basis  {/*„}  such  that  for  any  /  €  7i  we  can  write  the  Fourier  expansion  /  =  Yin  anhn  where 


5 


X 


Inout  Layer 


Hidden  Layers 


Outout  Layer 


y 

Figure  4:  Feedforward  neural  network. 

an  —<  f,hn  >.  For  example,  the  trignometric  system  ~ -ej2,rn*  j  is  an  orthonormal  basis 
for  the  Hilbert  space  L2  [ — 7r,  7r] .  The  Fourier  expansion  of  a  signal  with  respect  to  the  trigno¬ 
metric  system  is  useful  in  frequency  analysis  of  the  signal  since  each  basis  element  -~=e;,27rnt 
is  localized  in  frequency  at  cj  =  n.  Hence  the  distribution  of  coefficients  appearing  in  the 
Fourier  expansion  provides  information  about  the  frequency  composition  of  the  original  signal. 
In  many  applications  it  is  desirable  to  be  able  to  obtain  a  representation  of  a  signal  which  is 
localized  to  a  large  extent  in  both  time  and  frequency.  The  utility  of  joint  time-frequency  local¬ 
ization  is  easily  illustrated  by  noting  that  the  coefficients  in  the  Fourier  expansion  of  the  signal 
shown  in  Figure  5  do  not  readily  reveal  the  fact  that  the  signal  is  mostly  flat  and  that  high 


Figure  5:  Signal  for  which  time-frequency  localized  representations  are  useful 

frequency  components  are  localized  to  a  short  time  interval.  Examples  of  applications  where 
time-frequency  localization  is  desirable  can  be  found  for  instance  in  image  processing  [17]  [16] 
[7]  [23],  and  analysis  of  sound  patterns  [12].  One  method  of  obtaining  such  localization  is  the 
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windowed  Fourier  transform  .  This  involves  taking  the  Fourier  transform  of  a  signal  in  small 
time  windows  which  are  defined  by  a  window  function.  Hence  the  windowed  Fourier  transform 
provides  information  about  the  frequency  content  of  a  signal  over  a  relatively  short  interval  of 
time.  Doubly  ‘localized’  (well  concentrated  in  both  time  and  frequency)  representation  is  one 
of  the  primary  benefits  of  wavelet  decompositions.  However,  in  obtaining  such  a  localized  rep¬ 
resentation  using  ‘nice’  ‘basis’  functions,  it  is  sometimes  necessary  to  sacrifice  the  convenience 
of  decomposing  signals  with  respect  to  an  orthonormal  basis.  Instead  it  becomes  necessary  to 
consider  generalizations  of  orthonormal  bases  which  are  called  frames. 

3.1  Frames  in  Hilbert  Spaces 

Frames,  which  were  first  introduced  by  Duffin  and  Schaeffer  in  [8],  are  natural  generalizations 
of  orthonormal  bases  for  Hilbert  spaces. 

Definition  3.1  Given  a  Hilbert  space  H  and  a  sequence  of  vectors  {hn}'^=_00  C  H,  {hn}^L_0O 
is  called  a  frame  if  there  exist  constants  A  >  0  and  B  <  oo  such  that 

A||/||2  <£  |  </,*„>  |2  <£||/||2,  (2) 

n 

for  every  f  £  H.  A  and  B  are  called  the  frame  bounds. 


Remarks: 

(a)  A  frame  { hn }  with  frame  bounds  A  =  B  is  called  a  tight  frame. 

(b)  Every  orthonormal  basis  is  a  tight  frame  with  A  =  B  =  1. 

(c)  A  tight  frame  of  unit-norm  vectors  for  which  A  =  B  —  1  is  an  orthonormal  basis. 


Given  a  frame  {hn}  in  the  Hilbert  space  W,  with  frame  bounds  A  and  B,  we  can  define  the 
frame  operator ,  S  :H  — ►  Ti  as  follows.  For  any  /  6  H, 

Sf  =  J2<f’hn>hn.  (3) 

n 

The  following  theorem  lists  some  properties  of  the  frame  operator  which  we  shall  find  useful. 
Proofs  of  these  and  other  related  properties  of  frames  can  be  found  in  [9]  or  [6]. 

Theorem  3.1 


(1) 

(2) 

(3) 


S  is  a  bounded  linear  operator  with  AI  <  S  <  BI,  where  I  is  the  identity  operator  in  "H. 
S  is  an  invertible  operator  with  B~rI  <  S~1  <  A-1/. 


Since  AI  <  S  <  BI  implies  that  ||7—  -qq-gTl  <  1,  5  1  can  be  computed  via  the  Neumann 
series, 


2 

IT B 


A  +  B 


k 

a- 


(4) 
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(4)  The  sequence  {5  lhn }  is  also  a  frame,  called  the  dual  frame,  with  frame  bounds  B  1  and 

A-1. 

(5)  Given  any  f  EH,  f  can  be  decomposed  in  terms  of  the  frame  (or  dual  frame)  elements  as 

f  =  <  />  S~1}ln  >  hn  =  <  f'  kn  >  S~1}ln •  (5) 

(6)  Given  f  6  H,  if  there  exists  another  sequence  of  coefficients  {an}  (other  than  the  sequence 
{<  f,S~1hn  >}J  such  that  f  —  Y^an^n>  then  the  an’s  are  related  to  the  coefficients  given 
in  (5)  by  the  formula, 

Y,  w2  =  E  i  <  /» ■ >  i2 + E  i  <  />  s~lh « >  -a»i2-  (6) 

3.1.1  Definitions  Pertaining  to  Time-Frequency  Localization 

In  this  paper  we  shall  restrict  discussion  to  the  Hilbert  space  L2(1R)  which  is  the  space  of  all 
finite  energy  signals  on  the  real  line  i.e  /  6  L2(1R)  if  and  only  if 

/  \f(x)\2dx  <  co. 

If  /j<7  €  L2  (1R)  then  the  inner  product  <  f,g  >  is  defined  by 

<f,9>=  Lf{x)9ix)dx, 

Jr 

where  g  denotes  the  complex  conjugate  of  g ,  and  the  norm  ||  •  ||  on  L2(1R)  is  defined  by 

\\f\\2  =<  IJ  >• 

The  following  definitions  are  useful  in  formalizing  the  concept  of  time-frequency  localization. 
Definition  3.2  Given  a  function  f  G  L2(3R)  ,  /  :  IR  —>■  R,  with  Fourier  transform  f , 

(1)  the  center  of  concentration,  xc{f)>  °f  fi  defined  as 

x‘{f)=mLAf(x)]2dx- 

(2)  the  center  of  concentration,  wc(|/|2),  o/|/|2,  (or  center  frequency  of  f)  is  defined  as 

Note  that  wc(|/|2)  is  defined  so  as  to  account  for  the  evenness  of  |/|2  for  real-valued  /;  so 
(jc(\f\2)  is  the  positive  center  frequency  of  |/|2. 
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Remark: 


The  center  of  concentration  xc(f )  can  be  thought  of  as  the  location  parameter  (in  the  sense  of 
statistics)  of  the  density  |/|2/||/||2  on  1R. 

Definition  3.3 

The  support  of  a  function  f,  denoted  supp(f)  is  the  closure  of  the  set  {x  :  f(x)  >  0}. 

Definition  3.4  Given  f  6  L2(H)  ,  /  :  ]R  —*  1R,  with  Fourier  transform  f,  and  centers  of 
concentration  xc(f)  and  wc(\f\2), 

P(/;0  =  I  [*0.*i]  :  M/)  “  *o|  =  M/) -Xi\  and  [  \f(x)\2dx  <  e||/||2  1  , 

(  JxglRXRo.xi]  J 

and, 

^(/;«)  =  {[wo.Wi]  :  u0  =  max(0,u^),  K(|/|2)  -uT0\  =  K(|/|2)  -  oq|  , 

and  f  |/>)|2^<€l|/||2}. 

J  u/6i*X\[tt/o,^i3  J 

(1)  The  epsilon  support  (or  time  concentration)  of  f ,  denoted  e-supp(f,e)  is  the  set 
[a '0(f),Xi(f)]  £  V{f;e)  such  that, 

/t([xo(/)>*i(/)])  =  inf  M([z0,zi])- 

[x0,xi}eV 

(2)  The  epsilon  support  o/|/|2  (or  frequency  concentration  of  f)  denoted  e-supp(\f\2,  e)  is  the 
set  [wo(/))Wi(/)]  S  P(/;e)  such  that 

l«i  (/)  “  (/)  I  =  inf  -  u0| . 


Remark: 

The  e-support  of  /  is  the  smallest  (symmetric  about  xc(f))  interval  containing  (1  —  e)x  the 
total  signal  energy.  We  further  note  that  the  notion  of  e-support  introduced  here  is  used  later 
in  Section  5  to  formulate  a  synthesis  procedure  for  feedforward  neural  networks.  In  particular, 
the  e-support  affects  the  number  of  hidden  layer  nodes  needed  to  acheive  a  given  quality  of 
function  approximation. 

3.2  Discrete  Affine  Wavelet  Transforms 

Given  a  function  g  £  L2(1R)  ,  consider  the  sequence  of  functions  {gmn}  generated  by  dilating 
and  translating  g  in  the  following  manner, 

9mn{x)  =  an/2g(anx  -  mb),  (7) 

where,  a  >  0  and  b  >  0  and  m  and  n  are  integers.  Let  us  assume  that  g  £  L2(1R)  is  real- valued, 
concentrated  at  zero  with  sufficient  decay  away  from  zero,  and  that  e-supp(g,c)  =  [— L,  L], 
where  e  is  small  and  chosen  such  that  the  energy  contribution  of  g  outside  [— L,  L ]  is  negligible. 
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In  addition,  suppose  that  the  Fourier  transform  g  of  g  is  compactly  supported,  with  supp(<7)  = 
[— u>i,  -o>0]  U  [u;0,Wi]  and  concentrated  at  o;c(|<?|2),  0  <  w0  <  uc{\g\2)  <  uq  <  oo.  Recalling  the 
dilation  property  of  the  Fourier  transform, 

f(ax)  i  a~l  f(a~lu), 

we  see  that  supp(^mn)  =  [an«0,  anuq]  U  [-anuq,  anu0],  uc(\g^\2)  =  anuc(\g\2),  and  that  gmn  is 
concentrated  about  the  point  a~nmb  with  c-supp(gmn)  =  [a~n(-L  +  mb),a~n(L  +  mb)].  Hence 
if  we  could  write  an  expansion  of  any  /  €  L2  (1R)  as 

/  —  ^  ^  Cmn(f) 9mn  (8) 

77171 

then  each  coefficient  cmn(/)  provides  information  about  the  frequency  content  of  /  in  the  fre¬ 
quency  range oj  £  [flno0,  anuq]U[— anu>i,  -anu)0]  during  the  time  interval  [a~n (— L+mb) ,  a~n (L+ 
mb)]  about  xc(f). 

Discrete  affine  wavelet  transforms  provide  a  framework  within  which  it  is  possible  to  under¬ 
stand  expansions  of  the  form  given  in  (8).  In  a  general  setting,  discrete  affine  wavelet  transforms 
are  based  upon  the  fact  that  it  is  possible  to  construct  frames  for  L2(1R)  using  translates  and 
dilates  of  a  single  function.  That  is,  for  certain  functions  g  it  is  possible  to  determine  a  dilation 
stepsize  a  and  a  translation  stepsize  b  such  that  the  sequence  {gmn}  35  defined  by  (7)  is  a  frame2 
for  L2(1R)  .  In  this  case  (8)  is  referred  to  as  the  wavelet  expansion  of  /.  To  form  an  affine  frame 
the  mother  wavelet 3  g  must  satisfy  an  admissibility  condition, 

J^0rd“ <  °°-  (9) 

For  a  function  g  with  adequate  decay  at  infinity,  (9)  is  equivalent  to  the  requirement 
f  g(x)dx  =  0  (see  [6]).  Since  ^(O)  =  f  g(x)dx,  admissbility  (for  functions  with  adequate  decay) 
is  equivalent  to  requiring  that  ^(0)  =  0.  Furthermore  g  €  L2(1R)  together  with  admissibility 
implies  the  g  must  have  certain  approximate  ‘bandpass’  characteristics. 

Remarks 

•  The  term  discrete  affine  wavelet  transform,  is  derived  from  the  fact  that  the  functions 
gmn  are  generated  via  sampling  of  the  continuous  orbit  of  the  left  regular  representation 
of  the  affine  (ax  +  b)  group  associated  to  the  function  g.  A  review  of  the  implications  of 
group  representation  theory  in  wavelet  transforms  is  given  in  [9] . 

•  Windowed  Fourier  transforms  (of  which  the  Gabor  transform  [7]  [23]  is  a  special  case)  are 
obtained  via  a  representation  of  the  group  of  translations  and  complex  modulations  (the 
Weyl- Heisenberg  group)  on  L2(1R)  .  An  essential  difference  between  windowed  Fourier 
transforms  and  affine  wavelet  transforms  arises  due  to  the  particular  group  action  in¬ 
volved.  For  windowed  Fourier  transforms,  the  window  size  remains  constant  as  higher 
frequencies  are  analyzed  using  complex  modulations.  In  affine  wavelet  transforms  the 
higher  frequencies  are  analyzed  over  narrower  windows  due  to  the  dilations,  thereby  pro¬ 
viding  a  mechanism  for  ‘zooming’  in  on  fine  details  of  a  signal. 

2In  this  case  we  say  that  the  triplet  ( g ,  a,  b)  generates  an  affine  frame  for  L2(]R)  . 

3A1so  referred  to  as  the  fiducial  vector  or  analyzing  waveform. 
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4  Dilations  and  Translations  in  SISO  Neural  Networks 


In  this  section  we  shall  demonstrate  how  affine  wavelet  decompositions4  of  L2(H)  can  be  imple¬ 
mented  within  the  architecture  of  SISO  feedforward  neural  networks.  Consider  the  single-input- 
single-output  (SISO)  feedforward  neural  network  shown  in  Figure  6.  Input  and  output  layers  of 
this  network  each  consist  of  a  single  node,  whose  activation  function  is  linear  with  unity  gain. 


x 


Figure  6:  SISO  feedforward  neural  network 

In  addition,  the  network  has  a  single  hidden  layer  with  N  nodes,  each  with  activation  function 
g{-).  Hence  the  output  of  this  network  is  given  by 

N 

y  =  /(*)  =  wj,N+i9iwo,jX  -  I ;),  (10) 

3  =  1 

where  we  have  labeled  the  input  node  0  and  the  output  node  N  +  1.  It  is  clear  that  (10)  is  of 
the  form  in  (8)  with  two  key  differences:  (i)  The  summation  in  (10)  is  finite,  and  (ii)  Even  if 
we  permit  infinitely  many  hidden  layer  nodes,  and  let  gj  =  g(w0jx  —  Ij),  the  infinite  sequence 
{gn}  wih  n°t  necessarily  be  a  frame.  Since  it  is  our  intent  to  stay  within  the  general  framework 
of  feedforward  neural  networks,  let  us  first  consider  the  sigmoidal  function,  s(x )  =  (1  +  e~x)~l 
shown  in  Figure  3  as  a  possible  mother  wavelet  candidate.  Since  s  £  L2  (!R)  ,  it  is  impossible  to 
construct  a  frame  for  L2(H)  using  individual  translated  and  dilated  sigmoids  as  frame  elements. 
However,  we  note  that  the  difference  of  two  translated  sigmoids  is  in  L2  (]R)  for  finite  translations 

throughout  the  rest  of  this  paper  we  will  use  the  term  wavelet  transform  to  mean  discrete  affine  wavelet 
transform  unless  otherwise  indicated 
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and  that  in  general  if  we  let 


M  M 

v{x)  =  ^*anb  „(*)-X^„d„(*)  (H) 

n=l  n=l 

where  M  <  oo  and  sai,(x)  =  s(ax  —  b),  a,b  <  oo  then  ip  E  L2(R)  .  With  this  observation,  we 
show  that  it  is  possible  to  construct  frames  using  combinations  of  sigmoids  as  in  (11). 

4.1  Frames  From  Sigmoids 

Let  s(x)  =  (1  +  e-91)-1,  where  q  >  0  is  a  constant  which  controls  the  ‘slope’  of  the  sigmoid  . 
To  obtain  a  function  in  L2(1R)  ,  we  combine  two  sigmoids  as  in  (11).  Let 

(p(x)  —  s(x  +  d)  —  s(x  —  d),  0  <  d  <  oo.  (12) 

So,  <p(- )  (see  Figure  7)  is  an  even  function  which  decays  exponentially  away  from  the  origin. 


tlaa  (i«condi) 


Figure  7:  <p( x)  -  Sum  of  two  sigmoids,  and  the  magnitude  of  its  Fourier  transform 


Now,  let 


i>{x)  —  <p{x  +p)  -  -  p). 


(13) 


(Figure  8)  is  an  odd  function,  with  J  if(x)dx  =  0,  which  is  dominated  by  a  decaying 
exponential,  and  it  can  be  shown  that  V7  satisfies  the  admissibility  condition  (9).  The  Fourier 
transform  of  <p  is  given  by 


<p(w)  =  [  <p(x)e  iwxdx  = 
JR 


2i r  sin(wd) 
q  sinh(^)  ’ 


(14) 
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■tisa  (socands) 


Lag  Fraqu«ncy  (Hr) 

Figure  8:  ip(x)  -  Mother  Wavelet  candidate  constructed  from  sigmoids  and  magnitude  of  Fourier 
Transform  of  ip 


which  is  shown  in  Figure  7.  Therefore  the  Fourier  transform  of  ip  is, 

ip(u)  =  e~ipLJ{p(u)  —  e^pwip(u) 

—  —  j/2sin(pw)ip(w) 

.  47t  sin (pu)  sin(dcj) 

=  _;T  smh(^)  (15) 

which  is  shown  in  Figure  8.  Note  that  the  function  ip  is  reasonably  well  localized  in  both  the 
time  and  frequency  domains.  For  the  moment,  we  will  concentrate  on  the  specific  case  where 
p  —  1,  (2=1,  and  q  =  2  (which  is  the  case  used  for  the  plots  shown  in  Figures  7-8).  Table  1 
lists  some  relevant  parameters  describing  the  (numerically  determined)  localization  properties 
of  ip.  For  this  choice  of  (p,  d ,  q)  (and  in  general  whenever  p  =  d)  ip  is  a  linear  combination  of 
three  sigmoids,  ip( x)  =  s(x  +  2)  —  2 s{x)  +  s(x  —  2).  Figure  9  shows  the  implmentation  of  ip  in 
a  feedforward  network. 


€ 

? 

"dm 

e-supp  (ip,  e) 

e-supp  (\ip\2,e) 

0.1 

0.1 

0.0 

0.9420 

[-2.15,2.15] 

[0.2920,  1.5920] 

Table  1:  Time-frequency  localization  properties  of  ip  for  (p,  d,  q)  =  (1, 1, 2) 

It  is  our  goal  to  construct  a  frame  for  L2(R)  using  ip  as  the  mother  wavelet.  That  is,  we 
wish  to  find,  if  possible,  a  dilation  stepsize  a  and  a  translation  stepsize  b  such  that  the  triplet 


13 


X 


Figure  9:  Feedforward  network  implementation  of  if 

(if,  a ,b)  generates  an  affine  frame  for  L2(1R)  .  Recall  that,  we  say  ( if,a,b )  generates  an  affine 
frame  for  L2(1R)  if  the  sequence  {ifmn}  is  a  frame  for  L2(R)  where,  ifmn  =  an'2if(anx  —  mb). 
For  the  mother  wavelet  if  constructed  from  sigmoids  as  above,  it  is  possible  to  determine  values 
of  a  and  b  for  which  (if,  a ,  b)  generates  a  frame  for  L2(H)  (See  Appendix  A). 

It  follows  that  we  have  constructively  proved  the  following  analysis  result. 

Theorem  4.1  Feedforward  neural  networks  with  sigmoidal  activation  functions  and  a  single 
hidden  layer  can  represent  any  function  f  G  L2(1R)  .  Moreover,  given  f  G  L2(R)  ,  all  weights 
in  the  network  are  determined  by  the  wavelet  expansion  of  f, 

/(*)  =  (f'  S~1',f’rnn)  L(*)' 

m,n 


Remarks: 

(a)  In  this  section  we  have  concentrated  on  wavelets  constructed  from  sigmoids.  We  would 
however,  like  to  point  out  that  nonsigmoidal  activation  functions  are  also  of  considerable 
interest  and  we  refer  the  reader  to  [24].  The  techniques  of  wavelet  theory  should  be 
applicable  to  such  activation  functions  also. 

(b)  Among  other  activation  functions  used  in  neural  networks,  is  the  discontinuous  sigmoid 
(step)  function.  Note  that  using  such  a  step  function  together  with  the  methods  of  this 
section  results  in  a  mother  wavelet  if  which  is  the  Haar  wavelet.  Dilates  and  translates  of 
the  Haar  function  generate  an  orthonormal  basis  for  L2  (1R)  .  The  Haar  transform  is  the 
earliest  known  example  of  an  affine  wavelet  transform. 

4.2  Wavelets  For  L2(1R")  Constructed  From  Sigmoids 

Although  we  shall  primarily  restrict  attention  to  the  one-dimensional  setting  (L2(1R)  ),  wavelets 
for  higher  dimensional  domains  (L2(lRn)  )  can  also  be  constructed  within  the  standard  feed- 
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forward  network  setting  with  sigmoidal  activation  functions.  In  applications  such  as  image 
processing  it  is  desirable  to  use  wavelets  which  exhibit  orientation  selectivity  as  well  as  spatio- 
spectral  selectivity.  In  the  setting  of  Multiresolution  Analysis  [17]  for  example,  wavelet  bases 
for  L2(IFt2)  are  constructed  using  tensor  products  of  wavelets  for  L2(1R)  and  the  corresponding 
‘smoothing’  functions.  This  method  results  in  three  mother  wavelets  for  L2(IR2)  each  with  a 
particular  orientation  selectivity.  However  neural  network  applications  do  not  necessarily  re¬ 
quire  such  orientation  selective  wavelets.  In  this  case,  it  is  possible  to  use  translates  and  dilates 
of  a  single  ‘isotropic’  function  to  generate  wavelet  bases  or  frames  for  L2(IR,n)  (c.f.  [16]).  Fig¬ 
ure  4.2  shows  both  an  isotropic  mother  wavelet  and  an  orientation  selective  mother  wavelet  for 
L2  (R2)  which  are  implemented  in  a  standard  feedforward  neural  network  architecture  with  sig¬ 
moidal  activation  functions.  The  wavelets  of  Figure  4.2  are  implemented  by  taking  differences 
of  ‘bump’  functions  which  are  generated  using  a  construction  given  by  Cybenko  in  [1]. 


(a)  (b) 


Figure  10:  Two-Dimensional  wavelets  constructed  from  sigmoids:  (a)Isotropic  wavelet, 
(b) Orientation  selective  wavelet. 


5  Synthesis  of  Feedforward  Neural  Networks  Using  Wavelets 

In  the  last  section,  it  was  shown  that  it  is  possible  to  construct  an  affine  frame  for  L2(IR.)  using 
a  function  if)  which  is  a  linear  combination  of  three  sigmoidal  functions.  In  this  section,  we 
shall  examine  some  implications  of  the  wavelet  formalism  for  functional  approximation  based 
on  sigmoids,  in  the  synthesis  of  feedforward  neural  networks.  As  was  described  in  Section  2.1, 
sigmoidal  functions  have  served  as  the  basis  for  functional  approximation  by  feedforward  neural 
networks.  However,  in  the  absence  of  an  adequate  theoretical  framework,  topological  definitions 
of  feedforward  neural  networks  have  for  the  most  part  been  trial-and-error  constructions.  We 
will  demonstrate,  by  means  of  the  simple  network  discussed  in  Section  4,  how,  it  is  possible 
to  incorporate  the  joint  time  and  frequency  domain  characteristics  of  any  given  approximation 
problem  into  the  initial  network  configuration. 

Let  /  €  L2(It)  be  the  function  which  we  are  trying  to  approximate.  In  other  words,  we  are 
provided  a  set  0  of  sample  input-output  pairs  under  the  mapping  /, 

e  =  {(x',y'):yi  =  f(xiy,  xWeR}, 
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and  we  would  like  to  obtain  a  good  approximation  of  /.  To  perform  the  approximation  using 
a  neural  network,  the  first  step  is  to  decide  on  a  network  configuration.  For  this  problem,  it 
is  clear  that  the  input  and  output  layers  must  each  consist  of  a  single  node.  The  remaining 
questions  are  how  many  hidden  layers  should  we  use  and  how  many  nodes  should  there  be  in 
each  hidden  layer.  These  questions  can  be  addressed  using  the  wavelet  formulation  of  the  last 
section.  We  consider  a  network  of  the  form  in  Figure  6,  i.e.  with  a  single  hidden  layer.  At  this 
point,  a  traditional  approach  would  entail  fixing  the  number  of  nodes  N,  in  the  hidden  layer 
and  then  applying  a  learning  algorithm  such  as  backpropagation  (described  in  Section  2.1)  to 
adjust  the  three  sets  of  weights,  input  weights  {tn0j}yLi,  output  weights  and  the 

biases  { I j } .  We  would  like  to  use  information  contained  in  the  training  set  0  to,  (1)  decide  on 
the  number  of  nodes  in  the  hidden  layer,  and  (2)  reduce  the  number  of  weights  that  need  to  be 
adjusted  by  the  learning  algorithm. 

Here  we  describe  two  possible  schemes  for  use  of  the  wavelet  transform  formulation  in  the 
synthesis  of  feedforward  networks.  The  first  scheme  captures  the  essence  of  how  time-frequency 
localization  can  be  utilized  in  the  synthesis  procedure.  However,  this  scheme  is  difficlut  to 
implement  when  considering  high  dimensional  mappings  and  in  most  cases  will  result  in  a 
network  that  is  far  larger  than  necessary.  We  also  outline  a  second  method  which  further 
utilizes  the  time-frequency  localization  offered  by  wavelets  to  reduce  the  size  of  the  network. 
This  second  method  is  conceivably  a  more  viable  option  in  the  case  of  higher  dimensional 
mappings. 

5.1  Network  Synthesis:  Method  I 

Assume  /,  the  function  which  we  are  trying  to  approximate,  is  such  that  e-supp(|/|2,  e)  = 
[^min;  wmax]  where  Wmin  >  0  5-  Also  assume  that  there  exists  a  finite  interval  [^mim^max]  in 
which  we  wish  to  approximate  /.  Our  network  synthesis  procedure  is  described  in  algorithmic 
form  below. 


Synthesis  Algorithm: 

Step  I  Our  first  step  is  to  perform  a  frequency  analysis  of  the  training  data.  In  this  step  we 
wish  to  obtain  an  estimate  of  the  ‘bandwidth’  c-supp(|/|2,?)  of  /  based  on  the  samples  of  / 
provided  in  0.  A  number  of  techniques  can  be  considered  for  performing  this  estimate.  We 
will  not  elaborate  on  such  techniques  here.  Let  Smjn  be  our  estimate  of  and  wmax  be  our 
estimate  of  tjmax. 

Step  II  We  now  use  the  knowledge  of  w^,  wmax,  and  xmax  to  choose  the  particular 

frame  elements  to  be  used  in  the  approximation.  The  main  idea  in  this  step  is  to  choose  only 
those  elements  of  the  frame  {V’mn}  which  ‘cover’  the  region  Q/  of  the  time-frequency  plane 
defined  by 

Qf{€ j?)  =  [^niini  ^max]  x  ([wmini  umax]  kJ  [—  wmax)  — ^rain])- 
which  represents  the  concentration  of  /  in  time  and  frequency  as  determined  from  the  data 
0.  Recall  that  e-supp(|^|2,  c)  =  [<n0(y>), uq(^>)]  and  e-supp(?/>,  e)  =  [xo('tf)),  (see  Table  1). 

5Since  /  is  real-valued,  we  need  only  consider  positive  frequencies 
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Thus  the  concentration  of  the  mother  wavelet  if  in  the  time-frequency  plane  is  in  the  region 
[*o($)»®i(^)]  X  [wo(V’), aq  {i>)\ ■  Hence  the  concentration  of  in  the  time-frequency  plane  is 

=  [a-n(a;oW+ni6),a_n(a:i(^)  +  m6)]  x  ([ancjo('0)1  «nWi(V’)]U[-anWi(V»),  -anw0('i/>)]), 

which  is  centered  at  (xc(ipmn),Loc( \i>mn\2)  =  {xc(if)  4-  a~nmb,  a"u>c(|^|2).  Figure  11  shows 
the  location  of  Of,  and  the  Qmn’s  together  with  the  time-frequency  concentration  centers 


i 

co  I 


Figure  11:  Time- frequency  concentrations  Qf  and  Qmn’s  together  with  time-frequency  concen¬ 
tration  centers  (xc(ipmn),uc(\ifmn\2)  of  the  frame  elements. 

{Xc{i>mn)i  wc(|i/>mn|2)  of  the  frame  elements.  Therefore  to  ‘cover’  Qf(e,  e)  we  need  to  determine 
the  index  set  Z  of  pairs  (to,  n )  of  integer  translation  and  dilation  indices  such  that, 

Qmn  HQ//0,  for  (m,  n)  G  z. 

Daubechies  in  [6]  discusses  the  existence  of  a  bounding  box  Be  surrounding  the  time-frequency 
concentration  Qf  of  /  such  that  the  /  can  be  approximated  to  any  desired  precision  e  by 
including  in  the  approximation,  all  frame  elements  with  concentration  centers  in  B e. 

Step  III  Given  Z,  it  is  now  possible  to  configure  the  network.  From  the  manner  in  which 
Z  is  defined,  we  expect  to  be  able  to  obtain  an  approximation  to  /  of  the  form 

/(*)-  Y!  c™ni’mn{x)  =  7{x).  (16) 

for  x  G  [a^min,  £max].  The  approximation  error  in  (16)  can  be  made  arbitrarily  small  by  allowing 
e  and  ?  to  go  to  zero  in  the  computation  of  the  various  e-supports  used  to  define  the  sets  Q f 
and  Qmn.  This  is  because  we  know  that  {ifmn}  is  a  frame  and  therefore  it  is  possible  to  write 
/  as 

/w=  E  cmn  U)i>  mn  (17) 

•m,n(i2L 
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for  some  coefficients  {cmn(f)}.  Returning  to  the  single-hidden  layer  feedforward  network  shown 
in  Figure  6,  choose  the  number  of  nodes  in  the  hidden  layer  to  be  equal  to  the  number  of  elements 
in  X,  i.e.  N  =  #(I)  where  the  activation  function  of  each  node  is  taken  to  be6  y !>.  Now  if  we 
set  the  weights  form  the  input  node  to  the  hidden  layer  and  the  biases  on  each  hidden  layer 
node  to  the  dilation  and  translation  coefficients  indexed  by  (m,  n )  (EX,  then  the  output  of  the 
network  can  be  written  as 

V=  cmnTjjmn(x )  (18) 

(m,n)gX 

where  x  is  the  input  of  the  network  and  cmn’s  are  the  weights  form  the  hidden  layer  to  the 
output  node.  We  have  therefore  obtained  a  network  configuration  which  defines  an  output 
function  (18)  that  is  exactly  of  the  form  required  to  approximate  the  function  /  (Equation 
(16)). 

It  remains  to  determine  the  coefficients  cmn’s  in  (18)  that  will  result  in  the  desired  approx¬ 
imation. 

5.2  Network  Synthesis:  Method  II 

The  synthesis  algorithm  described  above  in  Section  5.1  uses  identification  of  an  ‘important’ 
region  Qj  of  the  time-frequency  plane.  Critical  to  identification  of  this  region  is  the  ‘bandwidth’ 
estimate  made  in  Step  I.  There  are  two  significant  drawbacks  of  making  such  a  bandwidth 
estimate: 

(1)  Estimation  of  spectral  concentration  of  signals  in  high  dimensions  is  computationally 
expensive. 

(2)  Any  estimate  of  spectral  concentration  which  relies  on  Fourier  techniques  is  going  to 
generate  a  generalized  rectangle  in  joint  time-frequency  space.  For  many  functions  such  a 
rectangular  concentration  in  time-frequency  is  simply  an  artifact  of  the  spatial  nonlocality 
of  the  Fourier  basis.  For  example,  an  estimate  of  the  frequency  concentration  of  the 
signal  in  Figure  5  will  generate  a  rectangle  in  time-frequency  as  the  concentration  of  the 
signal.  If  we  then  use  this  rectangle  to  choose  which  elements  of  a  wavelet  basis  to  use 
to  approximate  the  signal,  the  time-frequency  rectangle  will  dictate  that  large  dilations 
(corresponding  to  high  frequencies)  of  the  wavelets  be  used  over  the  entire  time  interval. 
However,  since  each  wavelet  is  also  localized  in  time,  and  high  frequency  components  of 
the  signal  are  localized  as  well,  this  is  clearly  an  excessive  number  of  wavelets.  Large 
dilations  can  be  used  locally  where  needed. 

Spatio-spectral  localization  properties  of  wavelets  can  be  further  exploited  to  reduce  the  number 
of  network  nodes  (wavelets)  used  in  the  approximation.  The  basic  idea  is  that  since  wavelets 
are  well-suited  to  identify  spatially  local  regions  of  fine  scale  (high  frequency)  features  in  a 
signal,  locations  and  values  local  maxima  of  the  wavelet  approximation  coefficients  at  one  scale 
(dilation)  indicate  whether  or  not  it  is  necessary  to  locally  refine  the  approximation  by  the  use 
of  wavelets  at  finer  scales  (c.f.  [18]).  A  network  synthesis  algorithm  using  this  idea  would  be 
an  adaptive  procedure  of  the  following  form. 

6Recall  that  i/>  is  a  linear  combination  of  three  sigmoids. 
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(1)  Construct  and  train  a  network  to  approximate  the  mapping  at  some  scale  an  over  the 
entire  spatial  region  of  interest. 

(2)  Identify  local  maxima  of  the  wavelet  coefficients  and  locally  refine  the  approximation  by 
adding  new  dilations  (nodes)  to  the  network  where  needed. 

(3)  Repeat  (2)  until  some  stopping  crterion  has  been  satisfied. 

Using  a  scheme  such  as  this  would  result  in  approximations  being  performed  over  regions  of 
time-frequency  of  the  form  shown  in  Figure  5.2.  Some  aspects  of  this  scheme  are  discussed  in 


Figure  12:  Form  of  time-frequency  coverage  from  approximation  scheme  of  Section  5.2 
[22]. 

5.3  Computation  of  Coefficients 

In  the  case  of  an  infinite  expansion  via  frame  elements,  there  exists  (at  least  in  theory)  a  method 
of  determining  the  expansion  coefficients  in  terms  of  the  inverse  of  the  frame  operator  S  defined 
in  (3).  From  (5),  we  see  that  given  the  frame  { tfmn }>  the  coefficients  in  (17)  are  given  by, 

C-mn  — ^  f  i  S  Tprnn  •  (1 9) 

From  Theorem  3.1,  we  see  that  in  principle  5_1^mn  can  be  computed  from  the  series  expansion 
given  in  (4).  However  rate  of  convergence  of  this  series  is  governed  by  how  close  the  frame  is 
to  being  a  tight  frame  i.e  by  how  close  the  ratio  B/A  is  to  1.  So  for  ‘loose’  frames  explicit 
computation  of  wavelet  expansion  coefficients  may  prove  overly  demanding  of  computational 
resources. 

Considering  now  the  case  of  a  finite  approximation  to  /  as  in  (16),  let  Span{/in}  denote 
the  closed  linear  span  of  the  vectors  {hn}.  It  is  clear  that  /  can  be  represented  exactly  by  the 
expansion  in  (16)  if  and  only  if  f  6  Span {^m„,  (m,  n)  €l}.  If  f  $  Sp&n{tpmn,  ( m,n )  €  X) 
then  the  ‘best’7  approximation  to  /  in  terms  of  the  finite  subset  of  frame  elements  with  indices 
in  I  is  the  projection  of  /  onto  Span{^>mn,  (m,  n )  €  X}.  In  this  case,  we  would  like  to  compute 
the  coefficients  of  expansion  of  the  projection  of  /  onto  Span{^>mn,  (m,  rz)  £  T). 

'With  respect  to  the  L2(1R)  norm. 
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5.3.1  Variational  computation  of  wavelet  coefficients  based  on  training  data 


Although  the  problem  of  determining  the  wavelet  coefficients  in  a  finite  approximation  can  be 
well  formulated,  we  know  of  no  analytic  solution  to  the  problem  of  explicitly  computing  the 
coefficients,  given  only  (possibly  irregularly  spaced)  samples  of  the  function.  We  can  however 
formulate  the  coefficient  computation  problem  as  a  variational  principle  in  a  fashion  analogous 
to  learning  algorithms  such  as  backpropagation.  We  define  our  cost  functional  to  be 

E=  II0'  -  Vl2  =  E  I  E  I2,  (20) 

(x‘,y')e0  (x‘,y')e&  (m,n)el 

where  0 1  is  the  output  of  the  network  when  x'  is  the  input  as  in  Section  2.1.  We  choose  the 
wavelet  coefficients  as  those  which  minimize  E.  As  a  result  of  the  wavelet  formulation,  the 
weights  to  be  determined  appear  linearly  in  the  output  equation  of  the  network.  Thus  E  is  a 
convex  function  of  the  coefficients  { cmn }  and  therefore  any  minimizer  c*  =  {c£jn}(m  n)eX  ^ 
is  a  global  minimizer.  Simple  iterative  optimization  algorithms  such  as  gradient  descent  can  be 
used  to  minimize  E. 

5.3.2  Normal  Equations 

There  exists  however  an  alternative  formulation  of  the  above  optimization  problem  which  pro¬ 
vides  a  non-iterative  solution.  Minimization  of  E  as  defined  in  (20)  defines  a  ‘least  squares’problem 
Therefore  solutions  can  be  determined  by  solving  the  system  of  linear  equations  constructed 
via  the  first  order  optimality  condition  (which  is  both  necessary  and  sufficient  in  this  case) 
4M-  =  0,  ( k,j )  6  I  at  any  minimizer  c*.  By  choosing  an  ordering  of  the  wavelet  terms 
{V’mm  (m,  n)  G  1}  the  normal  equations  can  be  written  as 


PC  =  W 


(21) 


where,  P  is  the  #(X)  X  #(X)  matrix  defined  by, 

P  =  [Pk,\  =  {  Y.  (22) 

(a :',y')eQ 

and 

w  =  [  E  E  ^*(i)^)yl]T-  (23) 

and  C  is  the  coefficient  vector  which  needs  to  be  solved  for.  Typically  solutions  of  (21)  will 
not  be  unique  and  stabilizing  methods  such  as  use  of  the  generalized  inverse,  P t  =  {P*  P)~l  P* 
must  be  applied. 

Remark 

Given  a  frame  {i)mn},  and  /  £  L2(1R)  let  c(f)  be  the  vector  in  /2defined  by  the  wavelet 
expansion  coefficients  (<C  f,S~1iJ}Tnn  >}  of  /.  From  Theorem  3.1  (6),  it  is  clear  that  if  the 
wavelet  expansion  of  /  G  L2(1R)  is  not  unique,  then  all  sequences  a(f)  in  /2(2£2)  of  wavelet 
expansion  coefficients  of  /  must  be  such  that  ||a(/)||2  =  ||c(/)||2  +  ||c(/)  —  a(/)||2.  Therefore  c(/) 
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is  an  optimal  sequence  of  expansion  coefficients  in  the  sense  of  being  minimum  (/2)  norm.  It 
can  easily  be  shown  that  any  finite  number  of  vectors  form  a  frame  for  their  span  (c.f.  [22]).  It 
is  also  well  known  that  use  of  the  generalized  inverse,  P\  of  P  results  in  the  minimum  l2  norm 
solution.  Thus  the  generalized  inverse  P *  is  a  sensible  choice  for  use  in  solving  (21). 

5.4  Simulations 

As  a  test  of  the  neural  network  synthesis  procedure  described  above,  we  simulated  a  few  simple 
examples.  As  a  first  test  we  chose  the  bandlimited  function  comprised  of  two  sinusoids  at 
different  frequencies,  specifically  f(x)  =  sm(27r5a;)  +  sm(27rl02;)  which  is  shown  in  Figure  13. 
Taking  =  0.0  and  £max  =  0.3,  50  randomly  spaced  samples  of  the  function  were  included 


Figure  13:  Original  bandlimited  function  f(x )  =  sin(27r5a:)  +  sin(27rl0x)  and  finite  wavelet 
approximation  (dashed  line). 

in  the  training  set  0.  A  single  dilation  of  the  mother  wavelet  was  chosen  (n  =  6)  which 
covered  the  frequency  range  adequately  (see  Figure  14).  Translations8  of  this  dilation  of  if 
which  contributed  significantly  in  the  interval  [xmini  ^max]  were  used,  resulting  in  40  hidden 
units.  Applying  a  simple  gradient  descent  scheme  to  minimize  E ,  an  approximation  to  /  was 
obtained.  The  resulting  approximation  is  shown  in  Figure  13  along  with  the  original  function. 

A  second,  slightly  more  complicated,  example  was  simulated  by  first  generating  a  random 
spectrum  (Figure  15)  which  is  concentrated  in  frequency  and  then  sampling  the  corresponding 
function  in  the  time  domain.  The  result  of  this  simulation  using  again  just  one  dilation  of  the 
mother  wavelet  is  shown  in  Figure  16. 

8  These  translations  were  integer  multiples  of  the  translation  stepsize  b. 
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Figure  14:  Wavelet  at  dilation  n  =  6  and  magnitude  of  Fourier  transform. 

6  Conclusions  and  Discussion 

We  have  demonstrated  that  it  is  possible  to  construct  a  theoretical  description  of  feedforward 
neural  networks  in  terms  of  wavelet  decompositions.  This  description  follows  naturally  from 
the  inherent  translation  and  dilation  structure  of  such  networks.  The  wavelet  description  of 
feedforward  networks  easily  characterizes  the  class  of  mappings  which  can  be  implemented  in 
such  architectures.  Although  such  characterizations  have  been  previously  provided  in  a  number 
of  different  forms  [2,  1,  10],  to  our  knowledge,  no  previous  characterization  using  sigmoidal 
activation  functions  is  capable  of  defining  the  exact  network  implementation  of  a  given  function. 
What  is  distinctly  different  about  the  wavelet  viewpoint  is  that  it  provides  an  extremely  flexible 
(not  necessarily  orthogonal)  transform  formalism.  This  flexibility  has  been  utilized  in  this  paper 
to  construct  a  transform  based  upon  combinations  of  sigmoids.  We  would  like  to  point  out  that 
in  general  there  is  nothing  special  about  sigmoidal  functions  and  that  a  variety  of  different 
activation  functions,  including  e.g.  orthogonal  wavelets  can  be  of  significant  interest.  Sigmoidal 
functions  however  hold  one  attraction;  such  functions  can  be  easily  implemented  in  analog 
integrated  circuitry  (see  e.g. [19]).  Aside  from  this,  we  have  chosen  to  work  with  sigmoidal 
functions  only  to  demonstrate  the  general  methodology  that  can  be  applied  in  the  context  of 
feedforward  neural  networks. 

In  addition  to  providing  a  theoretical  framework  within  which  to  perform  analysis  of  feed¬ 
forward  networks,  the  wavelet  formalism  supplies  a  tool  which  can  be  used  to  incorporate 
spatio-spectral  information  contained  in  the  training  data  in  structuring  of  the  network.  Two 
possible  schemes  to  perform  this  task  were  described  in  Section  5.  Minimality  in  terms  of  the 
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Figure  15:  Random  spectrum 

number  of  nodes  in  the  network  cannot  be  guaranteed  using  these  methods9.  However,  it  is 
possible  to  estimate  the  approximation  error  ([6])  in  terms  of  the  signal  energy  lying  outside 
the  chosen  spatio-spectral  region. 

In  this  paper,  attention  has  been  primarily  restricted  to  approximating  functions  in  L2(1R)  . 
Most  applications  where  neural  networks  are  particularly  useful  involve  mappings  in  higher 
dimensional  domains  (e.g.  in  vision,  robot  motion  control,  etc).  Although  extensions  of  the 
methods  of  this  paper  to  higher  dimensions  are  possible  (as  described  in  Section  4.2),  such 
extensions  have  the  potential  to  be  computationally  expensive.  We  are  currently  studying  the 
formulation  of  more  computationally  viable  synthesis  techniques  for  approximation  of  higher 
dimensional  mappings  using  feedforward  neural  networks. 

Using  the  wavelet  formalism  to  synthesize  networks  results  in  a  greatly  simplified  training 
problem.  Unlike  the  situation  in  traditional  feedforward  neural  network  constructions,  the  cost 
functional  is  convex  and  thereby  admits  global  minimizing  solutions  only.  Convexity  of  the 
cost  functional  is  a  result  of  fixing  the  weights  in  the  arguments  of  the  nonlinearities  so  as  to 
provide  the  required  dilations  and  translations.  Simple  iterative  solutions  to  this  problem  such 
as  gradient  descent  are  thus  justifiable  and  are  not  in  danger  of  being  trapped  in  local  minima. 


9This  problem  of  large  networks  is  particularly  limiting  when  considering  mappings  in  higher  dimensions 
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Figure  16:  Frequency-concentrated  signal  corresponding  to  random  spectrum  in  Figure  17  and 
finite  wavelet  approximation  (dashed  line). 

Appendix 


A  Determining  Translation  and  Dilation  Stepsizes 

Given  an  admissible  mother  wavelet  g  6  L2(5t)  ,  the  following  theorem  by  Daubechies  [6]  can 
be  used  to  numerically  determine  values  of  the  parameters  a  and  b  for  which  ( g ,  a ,  b )  generates 
an  affine  frame  for  L2  (R)  . 

Theorem  A.l  (Daubechies[6])  Let  g  e  L2(1R)  and  a  >  1  be  such  that: 


(V 


m(g;  a) 


ess  inf 

|wje[l,a 


Ei 


a"u> 


>  0 


(2) 

M(g',a)  —  e ss  sup  \g(anw)\2  <  oo 
Melba]  „ 


(3) 


where 


OO 

hn ^Y^fi(k/b)ll2l3(-k/b)112  =  0, 

fc=i 


/?(s)  =  ess  sup  Y]  \g(anu)\\g(anu  -  s)|. 
hie[i,a]V 


(24) 


(25) 


(26) 


Then  there  exists  Bc  >  0  such  that  ( g,a,b )  generates  an  affine  frame  for  each  0  <  b  <  Bc. 
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A 

Proof  of  the  following  corollary,  can  also  be  found  in  [6]. 

Corollary  A.l  If  g  €  L2(1R)  and  a  >  1  satisfy  the  hypotheses  of  Theorem  A.l  then, 

OO 

Bc~>bc  —  inf{6|  m{g\a )  -  2 '^T^0(k/b)1/‘2f3(—k/b)1/‘2  <  0}  (27) 

k=l 

and  for  0  <  b  <  bc,  the  frame  bounds  A  and  B  can  be  estimated  as, 

OO 

A  >  &_1(ra(£f;a)  -  2^d(lV&)1/2/3(-fc/&)1/2) 

k- 1 

OO 

B  <  6-1(M(5;a)  +  2^/3(fc/6)1/2/3(-fc/6)1/2)  (28) 

fc=i 

A.l  Dilation  and  Translation  Stepsizes  for  the  Wavelet  if  Constructed  From 
Sigmoids 

For  the  task  of  constructing  an  affine  frame  based  on  the  mother  wavelet  candidate  if  of  Section 
4.1  with  dilation  stepsize  a  —  2,  we  can  check  conditions  (24)  and  (25)  numerically.  Figure 

17  shows  a  plot  of  the  sum  in  (24)  using  the  mother  wavelet  candidate  if  with  dilation  step 
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Figure  17:  Plot  of  |^(a”w)|2  for  u  €  [l,a]  =  [1,2] 
size  a=2.  From  the  plot  in  Figure  17  the  minimum  value  of  the  sum  in  (24)  is  approximately 
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.5  1  1.5 

“Translation  Stepsize  b" 


Figure  18:  Plot  of  jL  0  ^(-k/b)1^ ,  for  various  values  of  b 


m(g]  2)  =7.88,  and  the  maximum  value  is  M(g\ 2)  =8.01.  Figure  18  is  a  plot  showing  the  value 
of  2J2k:£o  Pik/b)1/2 Pi—k/b)1/2 ,  for  various  values  of  b.  From  this,  we  see  that  for  a  =  2  and 
0  <  6  <  0.57,  (ip,  a ,  b )  generates  an  affine  frame  for  L2(R)  . 

Remark 

The  conditions  in  Theorem  A.l  and  subsequently  those  in  Corollary  A.l,  are  in  general  very 
conservative  since  the  theorem  relies  on  the  Cauchy-Schwartz  inequality  to  establish  bounds. 
Therefore  although  it  may  be  possible  to  determine  values  of  a  and  b  for  which  ( g ,  a,  b )  generates 
an  affine  frame,  for  a  given  mother  wavelet  g,  it  is  almost  certain  that  these  are  not  the  ‘best’ 
possible  values  of  a  and  b  which  can  be  used  with  g  to  obtain  a  frame.  For  very  small  values 
of  a  and  6,  a  large  number  of  frame  elements  will  be  required  to  ‘cover’  any  given  time  interval 
and  frequency  band.  Thus  it  is  desirable  to  use  the  largest  values  of  a  and  b  for  which  ( g ,  a,  b) 
generates  a  frame.  That  is,  we  would  like  the  frame  elements  to  be  as  sparsely  distributed  in 
joint  time-frequency  space  as  possible. 
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